Provenance in DISC Systems: Reducing Space Overhead at Runtime
نویسندگان
چکیده
Data intensive scalable computing (DISC) systems, such as Apache Hadoop or Spark, allow to process large amounts of heterogenous data. For varying provenance applications, emerging provenance solutions for DISC systems track all source data items through each processing step, imposing a high space and time overhead during program execution. We introduce a provenance collection approach that reduces the space overhead at runtime by sampling the input data based on the definition of equivalence classes. A preliminary empirical evaluation shows that this approach allows to satisfy many use cases of provenance applications in debugging and data exploration, indicating that provenance collection for a fraction of the input data items often suffices for selected provenance applications. When additional provenance is required, we further outline a method to collect provenance at query time, reusing, when possible, partial provenance already collected during program execution.
منابع مشابه
Using Provenance Patterns to Vet Sensitive Behaviors in Android Apps
We propose Dagger, a lightweight system to dynamically vet sensitive behaviors in Android apps. Dagger avoids costly instrumentation of virtual machines or modifications to the Android kernel. Instead, Dagger reconstructs the program semantics by tracking provenance relationships and observing apps’ runtime interactions with the phone platform. More specifically, Dagger uses three types of low-...
متن کاملProTracer: Towards Practical Provenance Tracing by Alternating Between Logging and Tainting
Provenance tracing is a very important approach to Advanced Persistent Threat (APT) attack detection and investigation. Existing techniques either suffer from the dependence explosion problem or have non-trivial space and runtime overhead, which hinder their application in practice. We propose ProTracer, a lightweight provenance tracing system that alternates between system event logging and un...
متن کاملRetrospective Provenance Without a Runtime Provenance Recorder
The YesWorkflow (YW) toolkit aims to provide users of scripting languages such as Python, Perl, and R with many of the benefits of scientific workflow automation. YW requires neither the use of a workflow engine nor the overhead of adapting or instrumenting code to run in such a system. Instead, YW enables scientists to annotate their scripts with special comments that reveal the main computati...
متن کاملThe Case of the Fake Picasso: Preventing History Forgery with Secure Provenance
As increasing amounts of valuable information are produced and persist digitally, the ability to determine the origin of data becomes important. In science, medicine, commerce, and government, data provenance tracking is essential for rights protection, regulatory compliance, management of intelligence and medical data, and authentication of information as it flows through workplace tasks. In t...
متن کاملHadoopProv: Towards Provenance as a First Class Citizen in MapReduce
We introduce HadoopProv, a modified version of Hadoop that implements provenance capture and analysis in MapReduce jobs. It is designed to minimise provenance capture overheads by (i) treating provenance tracking in Map and Reduce phases separately, and (ii) deferring construction of the provenance graph to the query stage. Provenance graphs are later joined on matching intermediate keys of the...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2017